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Abstract. We study sequential decision making in environments where 
rewards are only partially observed, but can be modeled as a function 
of observed contexts and the chosen action by the decision maker. This 
setting, known as contextual bandits, encompasses a wide variety of 
applications such as health care, content recommendation and Inter¬ 
net advertising. A central task is evaluation of a new policy given his¬ 
toric data consisting of contexts, actions and received rewards. The key 
challenge is that the past data typically does not faithfully represent 
proportions of actions taken by a new policy. Previous approaches rely 
either on models of rewards or models of the past policy. The former 
are plagued by a large bias whereas the latter have a large variance. 

In this work, we leverage the strengths and overcome the weaknesses 
of the two approaches by applying the doubly robust estimation tech¬ 
nique to the problems of policy evaluation and optimization. We prove 
that this approach yields accurate value estimates when we have ei¬ 
ther a good (but not necessarily consistent) model of rewards or a 
good (but not necessarily consistent) model of past policy. Extensive 
empirical comparison demonstrates that the doubly robust estimation 
uniformly improves over existing techniques, achieving both lower vari¬ 
ance in value estimation and better policies. As such, we expect the 
doubly robust approach to become common practice in policy evalua¬ 
tion and optimization. 

Key words and phrases: Contextual bandits, doubly robust estima¬ 
tors, causal inference. 


1. INTRODUCTION 

Contextual bandits (Auer et ah, 2002/03; Lang¬ 
ford and Zhang, 2008), sometimes known as asso- 
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ciative reinforcement learning (Barto and Anandan, 
1985), are a natural generalization of the classic mul¬ 
tiarmed bandits introduced by Robbins (1952). In a 
contextual bandit problem, the decision maker ob¬ 
serves contextual information, based on which an 
action is chosen out of a set of candidates; in re¬ 
turn, a numerical “reward” signal is observed for 
the chosen action, but not for others. The process 
repeats for multiple steps, and the goal of the deci¬ 
sion maker is to maximize the total rewards in this 
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process. Usually, contexts observed by the decision 
maker provide useful information to infer the ex¬ 
pected reward of each action, thus allowing greater 
rewards to be accumulated, compared to standard 
multi-armed bandits, which take no account of the 
context. 

Many problems in practice can be modeled by con¬ 
textual bandits. For example, in one type of Inter¬ 
net advertising, the decision maker (such as a web¬ 
site) dynamically selects which ad to display to a 
user who visits the page, and receives a payment 
from the advertiser if the user clicks on the ad (e.g., 
Chapelle and Li, 2012). In this case, the context can 
be the user’s geographical information, the action is 
the displayed ad and the reward is the payment. Im¬ 
portantly, we find only whether a user clicked on the 
presented ad, but receive no information about the 
ads that were not presented. 

Another example is content recommendation on 
Web portals (Agarwal et al., 2013). Here, the deci¬ 
sion maker (the web portal) selects, for each user 
visit, what content (e.g., news, images, videos and 
music) to display on the page. A natural objective is 
to “personalize” the recommendations, so that the 
number of clicks is maximized (Li et ah, 2010). In 
this case, the context is the user’s interests in dif¬ 
ferent topics, either self-reported by the user or in¬ 
ferred from the user browsing history; the action is 
the recommended item; the reward can be defined 
as 1 if the user clicks on an item, and 0 otherwise. 

Similarly, in health care, we only find out the clin¬ 
ical outcome (the reward) of a patient who received 
a treatment (action), but not the outcomes for alter¬ 
native treatments. In general, the treatment strat¬ 
egy may depend on the context of the patient such 
as her health level and treatment history. Therefore, 
contextual bandits can also be a natural model to 
describe personalized treatments. 

The behavior of a decision maker in contextual 
bandits can be described as a policy , to be de¬ 
fined precisely in the next sections. Roughly speak¬ 
ing, a policy is a function that maps the decision 
maker’s past observations and the contextual infor¬ 
mation to a distribution over the actions. This paper 
considers the offline version of contextual bandits: 
we assume access to historical data, but no ability 
to gather new data (Langford, Strehl and Wortman, 
2008; Strehl et al., 2011). There are two related 
tasks that arise in this setting: policy evaluation and 
policy optimization. The goal of policy evaluation 
is to estimate the expected total reward of a given 


policy. The goal of policy optimization is to obtain 
a policy that (approximately) maximizes expected 
total rewards. The focus of this paper is on policy 
evaluation, but as we will see in the experiments, 
the ideas can also be applied to policy optimiza¬ 
tion. The offline version of contextual bandits is im¬ 
portant in practice. For instance, it allows a web¬ 
site to estimate, from historical log data, how much 
gain in revenue can be achieved by changing the ad- 
selection policy to a new one (Bottou et al., 2013). 
Therefore, the website does not have to experiment 
on real users to test a new policy, which can be very 
expensive and time-consuming. Finally, we note that 
this problem is a special case of off-policy reinforce¬ 
ment learning (Precup, Sutton and Singh, 2000). 

Two kinds of approaches address offline policy 
evaluation. The first, called the direct method (DM), 
estimates the reward function from given data and 
uses this estimate in place of actual reward to 
evaluate the policy value on a set of contexts. 
The second kind, called inverse propensity score 
(IPS) (Horvitz and Thompson, 1952), uses impor¬ 
tance weighting to correct for the incorrect pro¬ 
portions of actions in the historic data. The first 
approach requires an accurate model of rewards, 
whereas the second approach requires an accurate 
model of the past policy. In general, it might be 
difficult to accurately model rewards, so the first 
assumption can be too restrictive. On the other 
hand, in many applications, such as advertising, 
Web search and content recommendation, the de¬ 
cision maker has substantial, and possibly perfect, 
knowledge of the past policy, so the second approach 
can be applied. However, it often suffers from large 
variance, especially when the past policy differs sig¬ 
nificantly from the policy being evaluated. 

In this paper, we propose to use the technique 
of doubly robust (DR) estimation to overcome 
problems with the two existing approaches. Dou¬ 
bly robust (or doubly protected) estimation (Cas- 
sel, Sarndal and Wretman, 1976; Robins, Rot- 
nitzky and Zhao, 1994; Robins and Rotnitzky, 1995; 
Lunceford and Davidian, 2004; Kang and Schafer, 
2007) is a statistical approach for estimation from 
incomplete data with an important property: if ei¬ 
ther one of the two estimators (i.e., DM or IPS) 
is correct, then the estimation is unbiased. This 
method thus increases the chances of drawing re¬ 
liable inference. 

We apply the doubly robust technique to policy 
evaluation and optimization in a contextual bandit 
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setting. The most straightforward policies to con¬ 
sider are stationary policies, whose actions depend 
on the current, observed context alone. Nonstation¬ 
ary policies, on the other hand, map the current con¬ 
text and a history of past rounds to an action. They 
are of critical interest because online learning algo¬ 
rithms (also known as adaptive allocation rules), by 
definition, produce nonstationary policies. We ad¬ 
dress both stationary and nonstationary policies in 
this paper. 

In Section 2, we describe previous work and con¬ 
nect our setting to the related area of dynamic treat¬ 
ment regimes. 

In Section 3, we study stationary policy evalua¬ 
tion, analyzing the bias and variance of our core 
technique. Unlike previous theoretical analyses, we 
do not assume that either the reward model or the 
past policy model are correct. Instead, we show how 
the deviations of the two models from the truth im¬ 
pact bias and variance of the doubly robust estima¬ 
tor. To our knowledge, this style of analysis is novel 
and may provide insights into doubly robust esti¬ 
mation beyond the specific setting studied here. In 
Section 4, we apply this method to both policy eval¬ 
uation and optimization, finding that this approach 
can substantially sharpen existing techniques. 

In Section 5, we consider nonstationary policy 
evaluation. The main approach here is to use the 
historic data to obtain a sample of the run of an 
evaluated nonstationary policy via rejection sam¬ 
pling (Li et ah, 2011). We combine the doubly ro¬ 
bust technique with an improved form of rejection 
sampling that makes better use of data at the cost 
of small, controllable bias. Experiments in Section 6 
suggest the combination is able to extract more in¬ 
formation from data than existing approaches. 

2. PRIOR WORK 
2.1 Doubly Robust Estimation 

Doubly robust estimation is widely used in sta¬ 
tistical inference (see, e.g., Kang and Schafer, 2007, 
and the references therein). More recently, it has 
been used in Internet advertising to estimate the ef¬ 
fects of new features for online advertisers (Lambert 
and Pregibon, 2007; Chan et ah, 2010). Most of pre¬ 
vious analysis of doubly robust estimation is focused 
on asymptotic behavior or relies on various model¬ 
ing assumptions (e.g., Robins, Rotnitzky and Zhao, 
1994; Lunceford and Davidian, 2004; Kang and 


Schafer, 2007). Our analysis is nonasymptotic and 
makes no such assumptions. 

Several papers in machine learning have used 
ideas related to the basic technique discussed here, 
although not with the same language. For benign 
bandits, Hazan and Kale (2009) construct algo¬ 
rithms which use reward estimators to improve 
regret bounds when the variance of actual re¬ 
wards is small. Similarly, the Offset Tree algorithm 
(Beygelzimer and Langford, 2009) can be thought 
of as using a crude reward estimate for the “offset.” 
The algorithms and estimators described here are 
substantially more sophisticated. 

Our nonstationary policy evaluation builds on the 
rejection sampling approach, which has been previ¬ 
ously shown to be effective (Li et ah, 2011). Rela¬ 
tive to this earlier work, our nonstationary results 
take advantage of the doubly robust technique and 
a carefully introduced bias/variance tradeoff to ob¬ 
tain an empirical order-of-magnitude improvement 
in evaluation quality. 

2.2 Dynamic Treatment Regimes 

Contextual bandit problems are closely related to 
dynamic treatment regime (DTR) estimation/opti¬ 
mization in medical research. A DTR is a set of (pos¬ 
sibly randomized) rules that specify what treatment 
to choose, given current characteristics (including 
past treatment history and outcomes) of a patient. 
In the terminology of the present paper, the pa¬ 
tient’s current characteristics are contextual infor¬ 
mation, a treatment is an action, and a DTR is a 
policy. Similar to contextual bandits, the quantity 
of interest in DTR can be expressed by a numeric 
reward signal related to the clinical outcome of a 
treatment. We comment on similarities and differ¬ 
ences between DTR and contextual bandits in more 
detail in later sections of the paper, where we de¬ 
fine our setting more formally. Here, we make a few 
higher-level remarks. 

Due to ethical concerns, research in DTR is of¬ 
ten performed with observational data rather than 
on patients. This corresponds to the offline ver¬ 
sion of contextual bandits, which only has access 
to past data but no ability to gather new data. 
Causal inference techniques have been studied to 
estimate the mean response of a given DTR (e.g., 
Robins, 1986; Murphy, van der Laan and Robins, 
2001), and to optimize DTR (e.g., Murphy, 2003; 
Orellana, Rotnitzky and Robins, 2010). These two 
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problems correspond to evaluation and optimization 
of policies in the present paper. 

In DTR, however, a treatment typically exhibits a 
long-term effect on a patient’s future “state,” while 
in contextual bandits the contexts are drawn IID 
with no dependence on actions taken previously. 
Such a difference turns out to enable statistically 
more efficient estimators, which will be explained in 
greater detail in Section 5.2. 

Despite these differences, as we will see later, con¬ 
textual bandits and DTR share many similarities, 
and in some cases are almost identical. For exam¬ 
ple, analogous to the results introduced in this pa¬ 
per, doubly robust estimators have been applied to 
DTR estimation (Murphy, van der Laan and Robins, 
2001), and also used as a subroutine for optimization 
in a family of parameterized policies (Zhang et ah, 
2012). The connection suggests a broader applica¬ 
bility of DTR techniques beyond the medical do¬ 
main, for instance, to the Internet-motivated prob¬ 
lems studied in this paper. 

3. EVALUATION OF STATIONARY POLICIES 
3.1 Problem Definition 

We are interested in the contextual bandit setting 
where on each round: 

1. A vector of covariates (or a context) x € X is 
revealed. 

2. An action (or arm) a is chosen from a given set 

A. 

3. A reward r £ [0,1] for the action a is revealed, but 
the rewards of other actions are not. The reward 
may depend stochastically on x and a. 

We assume that contexts are chosen IID from an 
unknown distribution D(x), the actions are chosen 
from a finite (and typically not too large) action set 
A, and the distribution over rewards D(r\a,x) does 
not change over time (but is unknown). 

The input data consists of a finite stream of triples 
(xfc,Ofc,rfc) indexed by k = 1,2,... , n. We assume 
that the actions a*, are generated by some past (pos¬ 
sibly nonstationary) policy, which we refer to as 
the exploration policy. The exploration history up 
to round k is denoted 

%k, — (xi, Oi, r\ , . . . , Xfc, Ofc, Tfc). 

Histories are viewed as samples from a probability 
measure fi. Our assumptions about data generation 


then translate into the assumption about factoring 
of pL as 

/Jj^Xhi Clfc) Tk\Zk— l) 

= D(x k )p(a k \x k ,z k _ 1 )D(r k \x k ,a k ), 

for any k. Note that apart from the unknown dis¬ 
tribution D, the only degree of freedom above is 
/i(afc|xfc, Zfc-i), that is, the unknown exploration pol¬ 
icy. 

When Zk ~i is clear from the context, we use a 
shorthand p k f° r the conditional distribution over 
the kth triple 

p k (x, a, r) = p(x k = x,a k = a, r k = r\z k -i). 

We also write and for P M [- \z k ~i\ and 
E^[' | Zk-l ] • 

Given input data z n , we study the stationary pol¬ 
icy evaluation problem. A stationary randomized 
policy v is described by a conditional distribution 
v{a\x) of choosing an action on each context. The 
goal is to use the history z n to estimate the value of 
v, namely, the expected reward obtained by follow¬ 
ing v. 

b iy) \ X )^-‘r^D(-\x,a)[^]- 

In content recommendation on Web portals, for ex¬ 
ample, V{v) measures the average click probability 
per user visit, one of the major metrics with critical 
business importance. 

In order to have unbiased policy evaluation, we 
make a standard assumption that if u{a\x) > 0 then 
/ifc(a \x) > 0 for all k (and all possible histories z k ~ i). 
This clearly holds for instance if p k {o\x) > 0 for all 
a. Since v is fixed in our paper, we will write V 
for V(v). To simplify notation, we extend the con¬ 
ditional distribution v to a distribution over triples 
(x,a,r) 

is(x,a,r) = D(x)u(a\x)D(r\a, x) 
and hence V = E^ [r]. 

The problem of stationary policy evaluation, de¬ 
fined above, is slightly more general than DTR anal¬ 
ysis in a typical cross-sectional observational study, 
where the exploration policy (known as “treatment 
mechanism” in the DTR literature) is stationary; 
that is, the conditional distribution p{a k \xk, z k ~i) 
is independent of z k -i and identical across all k, 
that is, Hk = pi for all k. 
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3.2 Existing Approaches 

The key challenge in estimating policy value in 
contextual bandits is that rewards are partially ob¬ 
servable: in each round, only the reward for the cho¬ 
sen action is revealed; we do not know what the 
reward would have been if we chose a different ac¬ 
tion. Hence, the data collected in a contextual ban¬ 
dit process cannot be used directly to estimate a 
new policy’s value: if in a context x the new policy 
selects an action a' different from the action a cho¬ 
sen during data collection, we simply do not have 
the reward signal for a!. 

There are two common solutions for overcom¬ 
ing this limitation (see, e.g., Lambert and Pregibon, 
2007, for an introduction to these solutions). The 
first, called the direct method (DM), forms an es¬ 
timate r(x,a) of the expected reward conditioned 
on the context and action. The policy value is then 
estimated by 

1 n 

Pdm = — EE v(a\x k )r(x k ,a). 

k= 1 aE*4 

Clearly, if r(x, a) is a good approximation of the 
true expected reward E/)[r|x,a], then the DM es¬ 
timate is close to V. A problem with this method 
is that the estimate r is typically formed without 
the knowledge of zz, and hence might focus on ap¬ 
proximating expected reward in the areas that are 
irrelevant for v and not sufficiently in the areas 
that are important for v (see, e.g., the analysis of 
Beygelzimer and Langford, 2009). 

The second approach, called inverse propensity 
score (IPS), is typically less prone to problems 
with bias. Instead of approximating the reward, IPS 
forms an approximation jl k (a\x) of /i k {a\x), and uses 
this estimate to correct for the shift in action pro¬ 
portions between the exploration policy and the new 
policy: 

yy 1 v(a k \x k ) 

IPS n w jl k {a k \x k ) rk ' 

If fi k (a\x) ~ ii k (a\x), then the IPS estimate above 
will be, approximately, an unbiased estimate of V. 
Since we typically have a good (or even accurate) 
understanding of the data-collection policy, it is of¬ 
ten easier to obtain good estimates fi k , and thus the 
IPS estimator is in practice less susceptible to prob¬ 
lems with bias compared with the direct method. 
However, IPS typically has a much larger variance, 


due to the increased range of the random variable 
v(a k \x k ) / fi k {a k \x k ). The issue becomes more severe 
when fi k (a k \x k ) gets smaller in high probability ar¬ 
eas under v. Our approach alleviates the large vari¬ 
ance problem of IPS by taking advantage of the es¬ 
timate r used by the direct method. 

3.3 Doubly Robust Estimator 

Doubly robust estimators take advantage of both 
the estimate of the expected reward r and the 
estimate of action probabilities fi k (a\x). A sim¬ 
ilar idea has been suggested earlier by a num¬ 
ber of authors for different estimation problems 
(Cassel, Sarndal and Wretman, 1976; Rotnitzky and 
Robins, 1995; Robins and Rotnitzky, 1995; Murphy, 
van der Laan and Robins, 2001; Robins, 1998). For 
the setting in this section, the estimator of Murphy, 
van der Laan and Robins (2001) can be reduced to 


1 n 

(3.1) Vdr = 

k =1 


r{x k ,v) 


v{a k \x k ) 

h J k ( K Q' k \%k) 


■ (r k -r{x k ,a k )) 


where 


f(x, v) = v{a\x)r(x, a) 

is the estimate of E^frlx] derived from f. Informally, 
the doubly robust estimator uses r as a baseline and 
if there is data available, a correction is applied. We 
will see that our estimator is unbiased if at least one 
of the estimators, r and p, k , is accurate, hence the 
name doubly robust. 

In practice, quite often neither Eyj[r|x,a] or p k 
is accurate. It should be noted that, although p, k 
tends to be much easier to estimate than 'Ejj[r\x,a\ 
in applications that motivate this study, it is rare 
to be able to get a perfect estimator, due to engi¬ 
neering constraints in complex applications like Web 
search and Internet advertising. Thus, a basic ques¬ 
tion is: How does the estimator LRr perform as the 
estimates r and fi k deviate from the truth? The fol¬ 
lowing section analyzes bias and variance of the DR 
estimator as a function of errors in r and fi k . Note 
that our DR estimator encompasses DM and IPS 
as special cases (by respectively setting fi k = oo and 
r = 0), so our analysis also encompasses DM and 
IPS. 
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3.4 Analysis 

We assume that r(x,a ) € [0,1] and fx k {a\x) € 
(0, oo], but in general Afc does not need to repre¬ 
sent conditional probabilities (our notation is only 
meant to indicate that fi k estimates /j, k , but no prob¬ 
abilistic structure). In general, we allow f and fi k to 
be random variables, as long as they satisfy the fol¬ 
lowing independence assumptions: 

• r is independent of z n . 

• fi k is conditionally independent of {(a^, a,£, ri)}t> k , 
conditioned on z k -i- 

The hrst assumption means that r can be assumed 
fixed and determined before we see the input data 
z n , for example, by initially splitting the input 
dataset and using the hrst part to obtain f and the 
second part to evaluate the policy. In our analysis, 
we condition on r and ignore any randomness in its 
choice. 

The second assumption means that Afe is not al¬ 
lowed to depend on future. A simple way to satisfy 
this assumption is to split the dataset to form an 
estimator (and potentially also include data Zk- 1 ). 
If we have some control over the exploration pro¬ 
cess, we might also have access to “perfect logging”, 
that is, recorded probabilities g, k (a k \x k ). With per¬ 
fect logging, we can achieve Afe = respecting our 
assumptions. 2 

Analogous to r(x,a), we define the population 
quantity r*(x,a) 

r*(x, a) = ~Ed[A x i a \i 

and define r*(x,v) similarly to f(a;,i') : 

r*(x, v) = Ej,[r|x]. 

Let A (x,a) and g k (x,a) denote, respectively, the 
additive error of f and the multiplicative error of jd k : 

A(x, a) = r(x, a ) — r*(x, a), 

Qk(x,a) = ii k {a\x)/fi k {a\x). 

We assume that for some M > 0, with probability 
one under fi: 

v(a k \x k )/ft k (a k \x k ) < M 

which can always be satisfied by enforcing jl k > 

1/M. 


2 As we will see later in the paper, in order to reduce the 
variance of the estimator it might still be advantageous to use 
a slightly inflated estimator, for example, jlk = cp,k for c > 1, 
or p,k(a\x) = maxjc, gk(a\x)} for some c > 0. 


To bound the error of Vdr> we first analyze a sin¬ 
gle term: 

V k = r(x k , u) + ■ (r k - f(x k , a k )). 

hk{a k \x k ) 

We bound its range, bias, and conditional variance 
as follows (for proofs, see Appendix A): 

Lemma 3.1. The range ofV k is bounded as 

|Vfc|<l + M. 

Lemma 3.2. The expectation of the term V k is 
E/[I4]= E [r*(x,a) + (1 - g k (x,a))A(x,a)]. 

(x,a)r^i/ 

Lemma 3.3. The variance of the term V k can be 
decomposed and bounded as follows: 

W V£[V*] 


= V 

x~D 


E [ r*(x,a ) 

• |ir) 


+ (1 - g k {x,a )) 

• A (a, a)] 

— E E [£>fc(x,a)A(z,a)] 2 

x~D • |ai) 

v{a\x) 


+ E 

(x,a)r^U 


A k{a\x) 

■ Qk{x,a ) 

u(a\x) 

+ E , ' \ 

(a;,a )~v \_ld k \a\x) 


V [r] 

• | x,a) 


■ g k (x,a)A(x,af 


(ii) V^[Vfc] 


< V [r*(x,v)\ 

x~D 

+ 2 E [|(1 - g k (x,a))A(x,a) 

(x,a)~i' 


+ M E 

(x,a)~v 


QkipC') &) 


E [(r —r(x,a)) ] 

r~D( • | x,a) 


The range of V k is controlled by the worst-case ra¬ 
tio v(a k \x k ) / fi k (a k \x k ). The bias of V k gets smaller 
as A and g k become more accurate, that is, as A « 0 
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and qi- ~ 1. The expression for variance is more com¬ 
plicated. Lemma 3.3(i) lists four terms. The first 
term represents the variance component due to the 
randomness over x. The second term can contribute 
to the decrease in the variance. The final two terms 
represent the penalty due to the importance weight¬ 
ing. The third term scales with the conditional vari¬ 
ance of rewards (given contexts and actions), and 
it vanishes if rewards are deterministic. The fourth 
term scales with the magnitude of A, and it cap¬ 
tures the potential improvement due to the use of a 
good estimator f. 

The upper bound on the variance [Lemma 3.3(b)] 
is easier to interpret. The first term is the variance 
of the estimated variable over x. The second term 
measures the quality of the estimators pk and f —it 
equals zero if either of them is perfect (or if the union 
of regions where they are perfect covers the support 
of v over x and a). The final term represents the 
importance weighting penalty. It vanishes if we do 
not apply importance weighting (i.e., /i^ = oo and 
Qk = 0). With nonzero g^, this term decreases with 
a better quality of r —but it does not disappear even 
if f is perfect (unless the rewards are deterministic). 

3.4.1 Bias analysis Lemma 3.2 immediately yields 
a bound on the bias of the doubly robust estima¬ 
tor, as stated in the following theorem. The special 
case for stationary policies (second part of the theo¬ 
rem) has been shown by Vansteelandt, Bekaert and 
Claeskens (2012). 


Theorem 3.4. Let A and g^ be defined as 
above. Then the bias of the doubly robust estimator 
is 


E m [F D r]-F| 


1 

n 


n 




Y E [(1 - gk(x,a))A(x,a)] 

*=i 


If the exploration policy p and the estimator pk are 
stationary (i.e., pk = pi and fik = hi for all k), the 
expression simplifies to 

|E m [Vdr] -V\ = |E„[(1 - ei (x,a))A(x,a)]|. 


Proof. The theorem follows immediately from 
Lemma 3.2. □ 


In contrast, we have (for simplicity, assuming sta- 
tionarity of the exploration policy and its estimate) 

|E m [Vdm] — E| = |E„[A(x,a)]|, 

|E m [Vips] ~ E| = |E„[r*(x, a)(l — Qi(x, a))]|, 


where the first equality is based on the observation 
that DM is a special case of DR with pk{a\x) = oo 
(and hence Qk = 0), and the second equality is based 
on the observation that IPS is a special case of DR 
with f(x, a) = 0 (and hence A = r*). 

In general, neither of the estimators dominates 
the others. However, if either A ~ 0, or g^ ~ 1, 
the expected value of the doubly robust estima¬ 
tor will be close to the true value, whereas DM 
requires A ~ 0 and IPS requires g^ ~ 1. Also, if 
\\0k — 1|| P ,v “C 1 [for a suitable L p (y) norm], we 
expect that DR will outperform DM. Similarly, if 
gk « 1 but || A|| P)1/ -C ||r*|| P)I/ , we expect that DR will 
outperform IPS. Thus, DR can effectively take ad¬ 
vantage of both sources of information to lower the 
bias. 

3.4.2 Variance analysis We argued that the ex¬ 
pected value of Vdr compares favorably with IPS 
and DM. We next look at the variance of DR. Since 
large-deviation bounds have a primary dependence 
on variance; a lower variance implies a faster con¬ 
vergence rate. To contrast DR with IPS and DM, 
we study a simpler setting with a stationary explo¬ 
ration policy, and deterministic target policy v, that 
is, u( • |x) puts all the probability on a single action. 
In the next section, we revisit the fully general set¬ 
ting and derive a finite-sample bound on the error 
of DR. 


Theorem 3.5. Let A and g & be defined as 
above. If exploration policy p and the estimator pk 
are stationary, and the target policy v is determinis¬ 
tic, then the variance of the doubly robust estimator 
is 


V^Rdr] 

= -( V [ r*(x,a ) 

Tl \(x,a)r^u 


+ E 


+ (1 - gi(x, a))A(x, a)] 

1 


- ~ / i \ * Qi{x,a) ■ V [r] 

(x,a)~v |_ Al r~D(’\x,a) 


1 — pi(a\x) 

(x,a)^v _ Pl(.o\x) 


+ E 


• Qi(x, a) A(x, a)' 


Proof. The theorem follows immediately from 
Lemma 3.3(i). □ 


The variance can be decomposed into three terms. 
The first term accounts for the randomness in x 
(note that a is deterministic given x). The other two 
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terms can be viewed as the importance weighting 
penalty. These two terms disappear in DM, which 
does not use rewards r^. The second term accounts 
for randomness in rewards and disappears when re¬ 
wards are deterministic functions of x and a. How¬ 
ever, the last term stays, accounting for the disagree¬ 
ment between actions taken by v and q i. 

Similar expressions can be derived for the DM and 
IPS estimators. Since IPS is a special case of DR 
with f = 0, we obtain the following equation: 

V^Pips] 

= -( v [Qi{x,a)r*{x,a)\ 


+ E 


1 


. f , > ei v [r] 

(x,a)~v \_Ql (o|xJ r~D( • |:r,a) 


+ E 


1 — Q\(a\x) 
Qi(a\x) 


gi(x,a)r*(x,aY 


The first term will be of similar magnitude as the 
corresponding term of the DR estimator, provided 
that g± ~ 1. The second term is identical to the DR 
estimator. However, the third term can be much 
larger for IPS if Q±(a\x) <C 1 and |A(x,a)| is smaller 
than r*(x,a ) for the actions chosen by v. 

In contrast, for the direct method, which is a spe¬ 
cial case of DR with Qk = oo, the following variance 
is obtained immediately: 

V^Vdm] = - V [r*(x, a) + A(x, a)]. 

^ (x,a)~is 


Thus, the variance of the direct method does not 
have terms depending either on the exploration pol¬ 
icy or the randomness in the rewards. This fact 
usually suffices to ensure that its variance is sig¬ 
nificantly lower than that of DR or IPS. However, 
as mentioned in the previous section, when we can 
estimate Qk reasonably well (namely, Qk ~ 1), the 
bias of the direct method is typically much larger, 
leading to larger errors in estimating policy values. 


3.4.3 Finite-sample error bound By combining 
bias and variance bounds, we now work out a spe¬ 
cific finite-sample bound on the error of the estima¬ 
tor Vdr- While such an error bound could be used 
as a conservative confidence bound, we expect it to 
be too loose in most settings (as is typical for finite- 
sample bounds). Instead, our main intention is to 
explicitly highlight how the errors of estimators f 
and Yk contribute to the final error. 


To begin, we first quantify magnitudes of the ad¬ 
ditive error A = r — r* of the estimator r, and the 
relative error \1 — Qk\ = \Qk ~ Qk\/Qk of the estima¬ 
tor Q k : 

Assumption 3.6. Assume there exist 6a, 6 e > 0 
such that 


E [|A(x,a)|] < 6a, 

(x,a)~i/ 

and with probability one under q: 

|1 — Qk(x,a) | < 6 e for all k. 

Recall that v / Q k < M. In addition, our analysis 
depends on the magnitude of the ratio Qk = Qk/Qk 
and a term that captures both the variance of the 
rewards and the error of f. 


Assumption 3.7. Assume there exist e?, Qm ax > 
0 such that with probability one under q, for all k: 


E E [(r(x, a) — r) 2 ] 

(x,a)r^u '- r ~D( • \x,a) 


— j 


Qk(x, a) < £ max for all x, a. 


With the assumptions above, we can now bound 
the bias and variance of a single term V k . As in the 
previous sections, the bias decreases with the quality 
of f and Qk , and the variance increases with the vari¬ 
ance of the rewards and with the magnitudes of the 
ratios v/Qk < M, Qk/Qk A ftmv The analysis below 
for instance captures the bias-variance tradeoff of 
using Qk ~ CQk for some c > 1: such a strategy can 
lead to a lower variance (by lowering M and £> max ) 
but incurs some additional bias that is controlled by 
the quality of r. 


Lemma 3.8. Under Assumptions 3.6-3. 7, with 
probability one under q, for all k: 

\K{Vk}-V\<6 e 6 A , 

V£[T4] < V x „ D [r*(x, i/)] + 26 q 6a + Mq max^r • 

Proof. The bias and variance bound follow 
from Lemma 3.2 and Lemma 3.3(h), respectively, 
by Holder’s inequality. □ 


Using the above lemma and Freedman’s inequality 
yields the following theorem. 

Theorem 3.9. Under Assumptions 3.6-3. 7, 
with probability at least 1 — 6, 

\V DR -V\ 

< 6 b 6a 
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+ 2 m ax ( (1 + M)lll(2 / j > , 

( n 

(Vx~p[r*(x,i')] + 25 e 5 A + Mg max ef) ln(2/ < 5 ) \ 

n J 

Proof. The proof follows by Freedman’s in¬ 
equality (Theorem B.l in Appendix B), applied to 
random variables 14 , whose range and variance are 
bounded using Lemmas 3.1 and 3.8. □ 

The theorem is a finite-sample error bound that 
holds for all sample size n, and in the limit the 
error converges to S Q S/\. As we mentioned, this re¬ 
sult gives a confidence interval for the doubly-robnst 
estimate Vdr for any finite sample n. Other au¬ 
thors have used asymptotic theory to derive con¬ 
fidence intervals for policy evaluation by showing 
that the estimator is asymptotically normal (e.g., 
Murphy, van der Laan and Robins, 2001; Zhang 
et al., 2012). When using asymptotic confidence 
bounds, it can be difficult to know a priori whether 
the asymptotic distribution has been reached, 
whereas our bound applies to all finite sample sizes. 
Although our bound may be conservative for small 
sample sizes, it provides a “safe” nonasymptotic 
confidence interval. In certain applications like those 
on the Internet, the sample size is usually large 
enough for this kind of nonasymptotic confidence 
bound to be almost as small as its asymptotic value 
(the term 5 q 5/\ in Theorem 3.9), as demonstrated 
by Bottou et al. (2013) for online advertising. 

Note that Assumptions 3.6-3.7 rely on bounds of 
11 — Qk | and Qk which have to hold with probability 
one. In Appendix C, we replace these bounds with 
moment bounds, and present analogs of Lemma 3.8 
and Theorem 3.9. 

4. EXPERIMENTS: THE STATIONARY CASE 

This section provides empirical evidence for the 
effectiveness of the DR estimator compared to IPS 
and DM. We study these estimators on several 
real-world datasets. First, we use public bench¬ 
mark datasets for multiclass classification to con¬ 
struct contextual bandit data, on which we evalu¬ 
ate both policy evaluation and policy optimization 
approaches. Second, we use a proprietary dataset 
to model the pattern of user visits to an Internet 
portal. We study covariate shift, which can be for¬ 
malized as a special case of policy evaluation. Our 
third experiment uses another proprietary dataset 
to model slotting of various types of search results 
on a webpage. 


4.1 Multiclass Classification with Partial 
Feedback 

We begin with a description of how to turn a K- 
class classification dataset into a Ii- arme cl contex¬ 
tual bandit dataset. Instead of rewards, we will work 
with losses, specifically the 0/1-classification error. 
The actions correspond to predicted classes. In the 
usual multiclass classification, we can infer the loss 
of any action on training data (since we know its 
correct label), so we call this a full feedback setting. 
On the other hand, in contextual bandits, we only 
know the loss of the specific action that was taken by 
the exploration policy, but of no other action, which 
we call a partial feedback setting. After choosing an 
exploration policy, our transformation from full to 
partial feedback simply “hides” the losses of actions 
that were not picked by the exploration policy. 

This protocol gives us two benefits: we can carry 
out comparison using public multiclass classification 
datasets, which are more common than contextual 
bandit datasets. Second, fully revealed data can be 
used to obtain ground truth value of an arbitrary 
policy. Note that the original data is real-world, but 
exploration and partial feedback are simulated. 

4.1.1 Data generation In a classification task, we 
assume data are drawn IID from a fixed distribution: 
{x,y) D, where x € X is a real-valued covariate 
vector and y G {1,2,. .. ,K} is a class label. A typ¬ 
ical goal is to find a classifier {1,2,..., K} 

minimizing the classification error: 

e(*0= E [I[v{x)^y\], 

(x,y)~D 

where /[•] is an indicator function, equal to 1 if its 
argument is true and 0 otherwise. 

The classifier v can be viewed as a deterministic 
stationary policy with the action set A = {1,..., K} 
and the loss function 

Kyi a ) = I \ a i z y\- 

Loss minimization is symmetric to the reward max¬ 
imization (under transformation r =1 — 1), but loss 
minimization is more commonly used in classifica¬ 
tion setting, so we work with loss here. Note that 
the distribution D(y\x) together with the definition 
of the loss above, induce the conditional probability 
D{l\x,a) in contextual bandits, and minimizing the 
classification error coincides with policy optimiza¬ 
tion. 
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Table 1 

Characteristics of benchmark datasets used in Section 4-1 


Dataset 

Ecoli 

Glass 

Letter 

Optdigits 

Page-blocks 

Pendigits 

Satimage 

Vehicle 

Yeast 

Classes ( K) 

8 

6 

26 

10 

5 

10 

6 

4 

10 

Sample size 

336 

214 

20,000 

5620 

5473 

10,992 

6435 

846 

1484 


To construct partially labeled data in multi¬ 
class classification, it remains to specify the explo¬ 
ration policy. We simulate stationary exploration 
with /j,k(a\x) = fi\(a\x) = 1/K for all a. Hence, the 
original example (x,y) is transformed into an ex¬ 
ample ( x,a,l(y,a )) for a randomly selected action 
a ~ uniform(l, 2,..., K). We assume perfect log¬ 
ging of the exploration policy and use the estimator 
Afc = l l k- Below, we describe how we obtained an 
estimator l(x,a ) (the counterpart of f). 

Table 1 summarizes the benchmark problems 
adopted from the UCI repository (Asuncion and 
Newman, 2007). 

4.1.2 Policy evaluation We first investigate wheth¬ 
er the DR technique indeed gives more accurate es¬ 
timates of the policy value (or classification error in 
our context), compared to DM and IPS. For each 
dataset: 

1. We randomly split data into training and evalu¬ 
ation sets of (roughly) the same size; 

2. On the training set, we keep full classification 
feedback of form (x,y) and train the direct loss 
minimization (DLM) algorithm of McAllester, 
Hazan and Keshet (2011), based on gradient de¬ 
scent, to obtain a classifier (see Appendix D for 
details). This classifier constitutes the policy v 
whose value we estimate on evaluation data; 


3. We compute the classification error on fully ob¬ 
served evaluation data. This error is treated as 
the ground truth for comparing various esti¬ 
mates; 

4. Finally, we apply the transformation in Sec¬ 
tion 4.1.1 to the evaluation data to obtain a 
partially labeled set (exploration history), from 
which DM, IPS and DR estimates are computed. 

Both DM and DR require estimating the expected 
conditional loss for a given (x,a). We use a lin¬ 
ear loss model: l(x,a ) = w a -x, parameterized by K 
weight vectors {ui a }ae{i,...,Ar}j and use least-squares 
ridge regression to fit w a based on the training set. 

Step 4 of the above protocol is repeated 500 times, 
and the resulting bias and rmse (root mean squared 
error) are reported in Figure 1. 

As predicted by analysis, both IPS and DR are un¬ 
biased, since the estimator is perfect. In contrast, 
the linear loss model fails to capture the classifica¬ 
tion error accurately, and as a result, DM suffers a 
much larger bias. 

While IPS and DR estimators are unbiased, it is 
apparent from the rmse plot that the DR estima¬ 
tor enjoys a lower variance, which translates into a 
smaller rmse. As we shall see next, this has a sub¬ 
stantial effect on the quality of policy optimization. 


IPS - 

DR 

dm. 


c n 

05 

JD 


0.2 

0.1 

0 
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DR 
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o 

0 

E 
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m -7-1 O -7-t C 



Fig. 1. Comparison of bias (left) and rmse (right) of the three estimators of classification error on partial feedback classifi¬ 
cation data. 
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4.1.3 Policy optimization This subsection devi¬ 
ates from much of the paper to study policy op¬ 
timization rather than policy evaluation. Given a 
space of possible policies, policy optimization is a 
procedure that searches this space for the policy 
with the highest value. Since policy values are un¬ 
known, the optimization procedure requires access 
to exploration data and uses a policy evaluator as 
a subroutine. Given the superiority of DR over DM 
and IPS for policy evaluation (in previous subsec¬ 
tion), a natural question is whether a similar benefit 
can be translated into policy optimization as well. 
Since DM is significantly worse on all datasets, as 
indicated in Figure 1, we focus on the comparison 
between IPS and DR. 

Here, we apply the data transformation in Sec¬ 
tion 4.1.1 to the training data, and then learn a 
classifier based on the loss estimated by IPS and 
DR, respectively. Specifically, for each dataset, we 
repeat the following steps 30 times: 

1. We randomly split data into training (70%) and 
test (30%) sets; 

2. We apply the transformation in Section 4.1.1 to 
the training data to obtain a partially labeled set 
(exploration history); 

3. We then use the IPS and DR estimators to im¬ 
pute unrevealed losses in the training data; that 
is, we transform each partial-feedback example 
(x, a, l ) into a cost sensitive example of the form 
(x,h,...,l K ) where l a ' is the loss for action a', 
imputed from the partial feedback data as fol¬ 


lows: 


/ 


a' 


i(x, a') + 


l — l(x, a') 
fii(a'\x) 


l(x, a'), 


if a' = a, 
if a' % a. 


In both cases, fii(a'\x) = 1/K (recall that = 
ftp); in DR we use the loss estimate (described 
below), in IPS we use l(x,a') = 0; 

4. Two cost-sensitive multiclass classification algo¬ 
rithms are used to learn a classifier from the 
losses completed by either IPS or DR: the first 
is DLM used also in the previous section (see 
Appendix D and McAllester, Hazan and Keshet, 
2011), the other is the Filter Tree reduction of 
Beygelzimer, Langford and Ravikumar (2008) 
applied to a decision-tree base learner (see Ap¬ 
pendix E for more details); 

5. Finally, we evaluate the learned classifiers on the 
test data to obtain classification error. 


Again, we use least-squares ridge regression to 
build a linear loss estimator: l(x,a ) = w a ■ x. How¬ 
ever, since the training data is partially labeled, w a 
is fitted only using training data (x,a',l) for which 
a = a!. Note that this choice slightly violates our as¬ 
sumptions, because l is not independent of the train¬ 
ing data z n . However, we expect the dependence to 
be rather weak, and we find this approach to be more 
realistic in practical scenarios where one might want 
to use all available data to form the reward estima¬ 
tor, for instance due to data scarcity. 

Average classification errors (obtained in Step 5 
above) of 30 runs are plotted in Figure 2. Clearly, 
for policy optimization, the advantage of the DR 


IPS (DLM) - 

DR (DLM) 

Offset Tree . 


IPS (Filter Tree) - 

DR (Filter Tree) 

Offset Tree . 


O 



o 
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Fig. 2. Classification error of direct loss minimization (left) and filter tree (right). Note that the representations used by 
DLM and the trees are very different, making any comparison between the two approaches difficult. However, the Offset Tree 
and Filter Tree approaches share a similar tree representation of the classifiers, so differences in performance are purely a 
matter of superior optimization. 
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is even greater than for policy evaluation. In all 
datasets, DR provides substantially more reliable 
loss estimates than IPS, and results in significantly 
improved classifiers. 

Figure 2 also includes classification error of the 
Offset Tree reduction (Beygelzimer and Langford, 
2009), which is designed specifically for policy opti¬ 
mization with partially labeled data. 3 4 While the IPS 
versions of DLM and Filter Tree are rather weak, the 
DR versions are competitive with Offset Tree in all 
datasets, and in some cases significantly outperform 
Offset Tree. 

Our experiments show that DR provides similar 
improvements in two very different algorithms, one 
based on gradient descent, the other based on tree 
induction, suggesting the DR technique is gener¬ 
ally useful when combined with different algorithmic 
choices. 

4.2 Estimating the Average Number of User 
Visits 

The next problem we consider is estimating the 
average number of user visits to a popular Internet 
portal. We formulate this as a regression problem 
and in our evaluation introduce an artificial covari¬ 
ate shift. As in the previous section, the original data 
is real-world, but the covariate shift is simulated. 

Real user visits to the website were recorded for 
about 4 million bcookies 4 randomly selected from 
all bcookies during March 2010. Each bcookie is as¬ 
sociated with a sparse binary covariate vector in 
5000 dimensions. These covariates describe brows¬ 
ing behavior as well as other information (such 
as age, gender and geographical location) of the 
bcookie. We chose a fixed time window in March 
2010 and calculated the number of visits by each 
selected bcookie during this window. To summa¬ 
rize, the dataset contains N = 3,854,689 data points: 
D = {(bi,Xi,Vi)}i= i,...,jv, where bi is the zth (unique) 
bcookie, Xi is the corresponding binary covariate 
vector, and Vi is the number of visits (the response 


3 We used decision trees as the base learner in Offset Trees 
to parallel our base learner choice in Filter Trees. The num¬ 
bers reported here are not identical to those by Beygelzimer 
and Langford (2009), even though we used a similar protocol 
on the same datasets, probably because of small differences 
in the data structures used. 

4 A bcookie is a unique string that identifies a user. Strictly 
speaking, one user may correspond to multiple bcookies, but 
for simplicity we equate a bcookie with a user. 


variable); we treat the empirical distribution over D 
as the ground truth. 

If it is possible to sample x uniformly at random 
from D and measure the corresponding value v, the 
sample mean of v will be an unbiased estimate of the 
true average number of user visits, which is 23.8 in 
this problem. However, in various situations, it may 
be difficult or impossible to ensure a uniform sam¬ 
pling scheme due to practical constraints. Instead, 
the best that one can do is to sample x from some 
other distribution (e.g., allowed by the business con¬ 
straints) and measure the corresponding value v. 
In other words, the sampling distribution of x is 
changed, but the conditional distribution of v given 
x remains the same. In this case, the sample average 
of v may be a biased estimate of the true quantity 
of interest. This setting is known as covariate shift 
(Shimodaira, 2000), where data are missing at ran¬ 
dom (see Kang and Schafer, 2007, for related com¬ 
parisons). 

Covariate shift can be modeled as a contextual 
bandit problem with 2 actions: action a = 0 cor¬ 
responding to “conceal the response” and action 
a = 1 corresponding to “reveal the response.” Be¬ 
low we specify the stationary exploration policy 
lik{a\x) = fj,i(a\x). The contextual bandit data is 
generated by first sampling (x,v) ~ D, then choos¬ 
ing an action a ~ fi\( ■ |x), and observing the reward 
r = a ■ v (i.e., reward is only revealed if a = 1). The 
exploration policy fi\ determines the covariate shift. 
The quantity of interest, E/)[u], corresponds to the 
value of the constant policy v which always chooses 
“reveal the response.” 

To define the exploration sampling probabilities 
/ii(a = l|x), we adopted an approach similar to 
Gretton et al. (2008), with a bias toward the smaller 
values along the first principal component of the 
distribution over x. In particular, we obtained the 
first principal component (denoted x) of all covari¬ 
ate vectors {a:j}i=i,...,Ar, and projected all data onto 
x. Let cj) be the density of a univariate normal 
distribution with mean m + (fh — m)/ 3 and stan¬ 
dard deviation (m —m)/4, where m is the minimum 
and m is the mean of the projected values. We set 
H\{a = l|x) = min{0(x ■ x), 1}. 

To control the size of exploration data, we ran¬ 
domly subsampled a fraction / E {0.0001, 0.0005, 
0.001, 0.005, 0.01, 0.05} from the entire dataset D 
and then chose actions a according to the explo¬ 
ration policy. We then calculated the IPS and DR 
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Fig. 3. Comparison of IPS and DR: rmse (left), bias (right). The ground truth policy value (average number of user visits) 
is 23.8. 


estimates on this subsample, assuming perfect log¬ 
ging, that is, Rk = Pk- r> The whole process was re¬ 
peated 100 times. 

The DR estimator required building a reward 
model r(x,a), which, for a given covariate vector 
x and a = 1, predicted the average number of vis¬ 
its (and for a = 0 was equal to zero). Again, least- 
squares ridge regression was used on a separate 
dataset to fit a linear model r(x, 1) =w-x from the 
exploration data. 

Figure 3 summarizes the estimation error of the 
two methods with increasing exploration data size. 
For both IPS and DR, the estimation error goes 
down with more data. In terms of rmse, the DR 
estimator is consistently better than IPS, especially 
when dataset size is smaller. The DR estimator of¬ 
ten reduces the rmse by a fraction between 10% and 
20%, and on average by 13.6%. By comparing to the 
bias values (which are much smaller), it is clear that 
DR’s gain of accuracy comes from a lower variance, 
which accelerates convergence of the estimator to 
the true value. These results confirm our analysis 
that DR tends to reduce variance provided that a 
reasonable reward estimator is available. 

4.3 Content Slotting in Response to User 
Queries 

In this section, we compare our estimators on 
a proprietary real-world dataset consisting of web 
search queries. In response to a search query, the 
search engine returns a set of search results. A search 
result can be of various types such as a web-link, a 


5 Assuming perfect knowledge of exploration probabilities is 
fair when we compare IPS and DR. However, it does not give 
implications of how DR compares against DM when there is 
an estimation error in /R. 


news snippet or a movie information snippet. We 
will be evaluating policies that decide which among 
the different result types to present at the first posi¬ 
tion. The reward is meant to capture the relevance 
for the user. It equals +1 if the user clicks on the 
result at the first position, —1 if the user clicks on 
some result below the first position, and 0 otherwise 
(for instance, if the user leaves the search page, or 
decides to rewrite the query). We call this a click- 
skip reward. 

Our partially labeled dataset consists of tuples of 
the form ( Xk,ak,rk,py ), where Xk is the covariate 
vector (a sparse, high-dimensional representation of 
the terms of the query as well as other contextual 
information, such as user information), a*, € {web- 
link, news, movie} is the type of result at the first 
position, rfc is the click-skip reward, and p is the 
recorded probability with which the exploration pol¬ 
icy chose the given result type. Note that due to 
practical constraints, the values pk do not always 
exactly correspond to Pk(a'k\xk) and should be re¬ 
ally viewed as the “best effort” approximation of 
perfect logging. We still expect them to be highly 
accurate, so we use the estimator fik{ctk\xk) =Pk- 

The page views corresponding to these tuples rep¬ 
resent a small percentage of user traffic to a major 
website; any visit to the website had a small chance 
of being part of this experiment. Data was collected 
over a span of several days during July 2011. It con¬ 
sists of 1.2 million tuples, out of which the first 1 mil¬ 
lion were used for estimating r (training data) with 
the remainder used for policy evaluation (evaluation 
data). The evaluation data was further split into 10 
independent subsets of equal size, which were used 
to estimate variance of the compared estimators. 

We estimated the value of two policies: the ex¬ 
ploration policy itself, and the argmax policy (de¬ 
scribed below). Evaluating exploration policy on 
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Table 2 

The results of different policy evaluators on two standard 
policies for a real-world exploration problem. In the first 
column, results are normalized by the (known) actual 
reward of the deployed policy. In the second column, 
results are normalized by the reward reported by IPS. All 
± are computed as standard deviations over results on 10 
disjoint test sets. In previous publication of the same 
experiments (Dudik et al. 2012), we used a 
deterministic-policy version of DR (the same as in Dudik, 
Langford and Li, 2011 ), hence the results for 
self-evaluation presented there slightly differ 



Self-evaluation 

Argmax 

IPS 

0.995 ±0.041 

1.000 ±0.027 

DM 

1.213 ±0.010 

1.211 ±0.002 

DR 

0.974 ± 0.039 

0.991 ±0.026 


its own exploration data (we call this setup self- 
evaluation) serves as a sanity check. The argmax 
policy is based on a linear estimator r'(x, a) =w a -x 
(in general different from f), and chooses the ac¬ 
tion with the largest predicted reward r'(x, a) (hence 
the name). We fitted r'(x,a ) on training data by 
importance-weighted linear regression with impor¬ 
tance weights 1/pk- Note that both f and r' are 
linear estimators obtained from the same training 
set, but f was computed without importance weights 
and we therefore expect it to be more biased. 

Table 2 contains the comparison of IPS, DM and 
DR, for both policies under consideration. For busi¬ 
ness reasons, we do not report the estimated reward 
directly, but normalize to either the empirical aver¬ 
age reward (for self-evaluation ) or the IPS estimate 
(for the argmax policy evaluation). 

The experimental results are generally in line with 
theory. The variance is smallest for DR, although 
IPS does surprisingly well on this dataset, presum¬ 
ably because f is not sufficiently accurate. The Di¬ 
rect Method (DM) has an unsurprisingly large bias. 
If we divide the listed standard deviations by \/l0, 
we obtain standard errors, suggesting that DR has 
a slight bias (on self-evaluation where we know the 
ground truth). We believe that this is due to imper¬ 
fect logging. 

5. EVALUATION OF NONSTATIONARY 
POLICIES 

5.1 Problem Definition 

The contextual bandit setting can also be used to 
model a broad class of sequential decision-making 


problems, where the decision maker adapts her 
action-selection policy over time, based on her ob¬ 
served history of context-action-reward triples. In 
contrast to policies studied in the previous two sec¬ 
tions, such a policy depends on both the current 
context and the current history and is therefore non¬ 
stationary. 

In the personalized news recommendation exam¬ 
ple (Li et al., 2010), a learning algorithm chooses 
an article (an action) for the current user (the con¬ 
text), with the need for balancing exploration and 
exploitation. Exploration corresponds to presenting 
articles about which the algorithm does not yet have 
enough data to conclude if they are of interest to a 
particular type of user. Exploitation corresponds to 
presenting articles for which the algorithm collected 
enough data to know that they elicit a positive re¬ 
sponse. At the beginning, the algorithm may pur¬ 
sue more aggressive exploration since it has a more 
limited knowledge of what the users like. As more 
and more data is collected, the algorithm eventu¬ 
ally converges to a good recommendation policy and 
performs more exploitation. Obviously, for the same 
user, the algorithm may choose different articles in 
different stages, so the policy is not stationary. In 
machine learning terminology, such adaptive proce¬ 
dures are called online learning algorithms. Evalu¬ 
ating performance of an online learning algorithm 
(in terms of average per-step reward when run for T 
steps) is an important problem in practice. Online 
learning algorithms are specific instances of nonsta¬ 
tionary policies. 

Formally, a nonstationary randomized policy is 
described by a conditional distribution 7r(at \xt, ht- 1 ) 
of choosing an action at on a context xt, given the 
history of past observations 

h t -1 = (xi,ai,ri),...,(xt-i,a t -i,r t -i). 

We use the index t (instead of k ), and write ht (in¬ 
stead of Zk ) to make clear the distinction between 
the histories experienced by the target policy n ver¬ 
sus the exploration policy p. 

A target history of length T is denoted h t- In our 
analysis, we extend the target policy Tr(at\xt, ht-i) 
into a probability distribution over hx defined by 
the factoring 

TT(xt,a t ,r t \ht-i) = D(xt)7r(at\xt,ht-i)D(rt\xt,a t ). 

Similarly to //, we define shorthands tt t(x,a,r), P^, 
EJ 1 ". The goal of nonstationary policy evaluation is to 
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estimate the expected cumulative reward of policy 
7 r after T rounds: 

' T 

V\ :T = e y. 

In the news recommendation example, r* indicates 
whether a user clicked on the recommended article, 
and V\_t is the expected number of clicks garnered 
by an online learning algorithm after serving T user 
visits. A more effective learning algorithm, by defi¬ 
nition, will have a higher V\-t value (Li et al., 2010). 

Again, to have unbiased policy evaluation, we as¬ 
sume that if 7Tt(a|x) > 0 for any t (and some his¬ 
tory ht-i) then ^j, k {a\x) > 0 for all k (and all possi¬ 
ble histories z k _ 1 ). This clearly holds for instance if 
fi k (a\x) > 0 for all a. 

In our analysis of nonstationary policy evaluation, 
we assume perfect logging, that is, we assume access 
to probabilities 

Pk •— P j k(&k\x k ') ■ 

Whereas in general this assumption does not hold, it 
is realistic in some applications such as those on the 
Internet. For example, when a website chooses one 
news article from a pool to recommend to a user, 
engineers often have full control/knowledge of how 
to randomize the article selection process (Li et ah, 
2010; Li et ah, 2011). 

5.2 Relation to Dynamic Treatment Regimes 

The nonstationary policy evaluation problem de¬ 
fined above is closely related to DTR analysis in 
a longitudinal observational study. Using the same 
notation, the inference goal in DTR is to estimate 
the expected sum of rewards by following a possibly 
randomized rule 7r for T steps. 6 Unlike contextual 
bandits, there is no assumption on the distribution 
from which the data z n is generated. More precisely, 
given an exploration policy //, the data generation 
is described by 

/J.(xfc, Q kl T k \z k —\) 

= D(x k \z k -i)p(a k \x k ,Zk-i)D(r k \xk,a k ,Zk-i). 

Compared to the data-generation process in contex¬ 
tual bandits (see Section 3.1), one allows the laws 

6 In DTR often the goal is to estimate the expectation of a 
composite outcome that depends on the entire length-T tra¬ 
jectory. However, the objective of composite outcomes can 
easily be reformulated as a sum of properly redefined rewards. 


of x k and r k to depend on history z k _\. The tar¬ 
get policy 7r is subject to the same conditional laws. 
The setting in longitudinal observational studies is 
therefore more general than contextual bandits. 

IPS-style estimators (such as DR of the previous 
section) can be extended to handle nonstationary 
policy evaluation, where the likelihood ratios are 
now the ratios of likelihoods of the whole length- 
T trajectories. In DTR analysis, it is often assumed 
that the number of trajectories is much larger than 
T. Under this assumption and with T small, the 
variance of IPS-style estimates is on the order of 
0(l/n), diminishing to 0 as n —>• oo. 

In contextual bandits, one similarly assumes n 
T. However, the number of steps T is often large, 
ranging from hundreds to millions. The likelihood 
ratio for a length-T trajectory can be exponential 
in T, resulting in exponentially large variance. As 
a concrete example, consider the case where the 
exploration policy (i.e., the treatment mechanism) 
chooses actions uniformly at random from K pos¬ 
sibilities, and where the target policy 7r is a deter¬ 
ministic function of the current history and context. 
The likelihood ratio of any trajectory is exactly K T , 
and there are n/T trajectories (by breaking z n into 
n/T pieces of length T). Assuming bounded vari¬ 
ance of rewards, the variance of IPS-style estimators 
given data z n is 0(TK t / n), which can be extremely 
large (or even vacuous) for even moderate values of 
T, such as those in the studies of online learning in 
the Internet applications. 

In contrast, the “replay” approach of Li et al. 
(2011) takes advantage of the independence be¬ 
tween ( x k ,r k ) and history z k _\. It has a variance 
of 0(KT/n), ignoring logarithmic terms, when the 
exploration policy is uniformly random. When the 
exploration data is generated by a nonuniformly ran¬ 
dom policy, one may apply rejection sampling to 
simulate uniformly random exploration, obtaining 
a subset of the exploration data, which can then 
be used to run the replay approach. However, this 
method may discard a large fraction of data, espe¬ 
cially when the historical actions in the log are cho¬ 
sen from a highly nonuniform distribution, which 
can yield an unacceptably large variance. The next 
subsection describes an improved replay-based esti¬ 
mator that uses doubly-robust estimation as well as 
a variant of rejection sampling. 

5.3 A Nonstationary Policy Evaluator 

Our replay-based nonstationary policy evaluator 
(Algorithm 1) takes advantage of high accuracy 
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Algorithm 1 

DR-ns(7r, {(x k ,a k ,r k ,p k )} k= r, q, c max , T) 

Input: 

target nonstationary policy 7r 
exploration data {{x k ,a k ,r k ,p k )} k= ^ 2 ,...,n 
reward estimator r(x,a ) 
rejection sampling parameters: 

q G [0,1] and c max G (0,1] 
number of steps T for estimation 

Initialize: 

simulated history of target policy ho G- 0 
simulated step of target policy t G- 0 
acceptance rate multiplier ci G- c max 
cumulative reward estimate VbR-ns G- 0 
cumulative normalizing weight C G- 0 
importance weights seen so far Q G- 0 

For k = 1,2,... consider event (xk,a k ,rk,Pk)- 

(1) V k 4- f(x k , 7r t ) + 7rf( ° fc fc |a!fc) • {rk ~ r(xk,a k )) 

(2) Vdr- ns •e- hf)R -ns + CtVk 

(3) C^C + c t 

(5) Let ~ uniform[0, 1] 

(6) If u k < ctnt ^ x ^ 

(a) hi G- h t _i + (x k , a k ,r k ) 

(b) t<—t + 1 

(c) if t = T + 1, go to “Exit” 

(d) Ct G- min{c max , f/th quantile of Q} 

Exit: If t < T + 1, report failure and terminate; 
otherwise, return: 

cumulative reward estimate VbR,- ns 
average reward estimate := Vdr- ns/C 


of DR estimator while tackling nonstationarity via 
rejection sampling. We substatially improve sam¬ 
ple use (i.e., acceptance rate) in rejection sampling 
while only modestly increasing the bias. This algo¬ 
rithm is referred to as DR-ns, for “doubly robust 
nonstationary.” Over the run of the algorithm, we 
process the exploration history and run rejection 
sampling [Steps (5)—(6)] to create a simulated his¬ 
tory ht of the interaction between the target policy 
and the environment. If the algorithm manages to 
simulate T steps of history, it exits and returns an 
estimate Vr>R-ns °f the cumulative reward V\-t- and 
an estimate V^^ ns of the average reward V\ : t/T; 


otherwise, it reports failure indicating not enough 
data is available. 

Since we assume n T, the algorithm fails with 
a small probability as long as the exploration pol¬ 
icy does not assign too small probabilities to actions. 
Specifically, let a > 0 be a lower bound on the accep¬ 
tance probability in the rejection sampling step; that 
is, the condition in Step (6) succeeds with probabil¬ 
ity at least a. Then, using the Hoeffding’s inequality, 
one can show that the probability of failure of the 
algorithm is at most 5 if 

n> T + Heji) 
a 

Note that the algorithm returns one “sample” of 
the policy value. In reality, the algorithm continu¬ 
ously consumes a stream of n data, outputs a sam¬ 
ple of policy value whenever a length-T history is 
simulated, and finally returns the average of these 
samples. Suppose we aim to simulate m histories of 
length T. Again, by Hoeffding’s inequality, the prob¬ 
ability of failing to obtain m trajectories is at most 
5 if 

mT + ln(eM) 

n >-. 

a 

Compared with naive rejection sampling, our ap¬ 
proach differs in two respects. First, we use not only 
the accepted samples, but also the rejected ones to 
estimate the expected reward Ef [r] with a DR esti¬ 
mator [see Step (1)]. As we will see below, the value 
of l/ct is in expectation equal to the total number 
of exploration samples used while simulating the fth 
action of the target policy. Therefore, in Step (2), we 
effectively take an average of 1 /q estimates of EJ 1 " [r], 
decreasing the variance of the final estimator. This 
is in addition to lower variance due to the use of the 
doubly robust estimate in Step (1). 

The second modification is in the control of the 
acceptance rate (i.e., the bound a above). When 
simulating the tth action of the target policy, 
we accept exploration samples with a probability 
min{l, ctiTt/pk} where ct is a multiplier [see Steps 
(5)—(6)] . We will see below that the bias of the esti¬ 
mator is controlled by the probability that ct^t/Pk 
exceeds 1, or equivalently, that Pk/nt falls below q. 
As a heuristic toward controlling this probability, 
we maintain a set Q consisting of observed density 
ratios Pk/^t, and at the beginning of simulating the 
ith action, we set q to the gth quantile of Q, for 
some small value of q [Step (6) (d)], while never al¬ 
lowing it to exceed some predetermined c max . Thus, 
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the value q approximately corresponds to the prob¬ 
ability value that we wish to control. Setting q = 0, 
we obtain the unbiased case (in the limit). By using 
larger values of q , we increase the bias, but reach 
the length T with fewer exploration samples thanks 
to increased acceptance rate. A similar effect is ob¬ 
tained by varying c max , but the control is cruder, 
since it ignores the evaluated policy. In our exper¬ 
iments, we therefore set c max = 1 and rely on q to 
control the acceptance rate. It is an interesting open 
question how to select q and c in practice. 

To study our algorithm DR-ns, we modify the defi¬ 
nition of the exploration history so as to include the 
samples u k from the uniform distribution used by 
the algorithm when processing the fcth exploration 
sample. Thus, we have an augmented definition 


— (^T, Q-l , T\ , U \, • • • , X kl CL k , T kl t£fc). 

With this in mind, expressions P(! and E(( in¬ 
clude conditioning on variables u\,... ,u k ~ i, and fx 
is viewed as a distribution over augmented histo¬ 
ries z n . 

For convenience of analysis, we assume in this sec¬ 
tion that we have access to an infinite exploration 
history z (i.e., z n for n = oo) and that the counter 
t in the pseudocode eventually becomes T + 1 with 
probability one (at which point hx is generated). 
Such an assumption is mild in practice when n is 
much larger than T. 

Formally, for t > 1, let k{€) be the index of the 
fth sample accepted in Step (6); thus, k converts 
an index in the target history into an index in 
the exploration history. We set /c(0) = 0 and define 
n(t) = oo if fewer than t samples are accepted. Note 
that k is a deterministic function of the history z 
(thanks to including samples u k in z). We assume 
that P^[k(T) = oo] = 0. This means that the algo¬ 
rithm (together with the exploration policy fx) gen¬ 
erates a distribution over histories hr] we denote 
this distribution jf. 

Let B(t) = {n(t — 1) + 1 ,n(t — 1) + 2,..., n(t)} for 
t > 1 denote the set of sample indices between the 
(t — l)st acceptance and the fth acceptance. This set 
of samples is called the fth block. The contribution 
of the tth block to the value estimator is denoted 
Vg( t ) = YlkeB(t) T&- After completion of T blocks, 
the two estimators returned by our algorithm are 


T 

Wn , = 5>VB (t ), 

t= 1 


El =1 CtVB(t) 

Ehct\B(t)\- 


5.4 Bias Analysis 

A simple approach to evaluating a nonstationary 
policy is to divide the exploration data into sev¬ 
eral parts, run the algorithm separately on each 
part to generate simulated histories, obtaining es¬ 
timates Tor- ns’ • • • > ^DR-n S i and return the average 

YYILi Tor- ns /rn- * * * * 7 Here, we assume n is large enough 
so that m simulated histories of length T can be 
generated with high probability. Using standard 
concentration inequalities, we can then show that 
the average is within 0(l/^/m) of the expectation 
E /( [VoR-ns] • The remaining piece is then bounding 
the bias term E M [VbR- ns ] - ^AYlt=i r t\ - 8 
Recall that Vb R _ ns = Yt=i ■ The source of 

bias are events when ct is not small enough to guar¬ 
antee that ctHtX a k\ x k)/'Pk is a probability. In this 
case, the probability that the kth exploration sam¬ 
ple includes the action and is accepted is 

(5.1)Pfcmin/ I, Q7r ^ afc 4'fc) ) = m i n { pk ^ Ct7Tt ( ak \ Xk )}, 

{ Pk J 

which may violate the unbiasedness requirement of 
rejection sampling, requiring that the probability of 
acceptance be proportional to Tr t (ak\xk)- 
Conditioned on Zk- 1 and the induced target his¬ 
tory ht- 1 , define the event 

£ k := {(x,a):c t ir t (a\x) > /x k (a\x)}, 

which contributes to the bias of the estimate, be¬ 
cause it corresponds to cases when the minimum in 
equation (5.1) is attained by p k . Associated with this 
event is the “bias mass” e k , which measures (up to 
scaling by q) the difference between the probability 
of the bad event under 7p and under the run of our 
algorithm: 

E-k ■ r*(i,a)~7Tt \£k\ [£fc]/Q• 

Notice that from the definition of £ k , this mass is 
nonnegative. Since the first term is a probability, 
this mass is at most 1. We will assume that this 


'We only consider estimators for cumulative rewards (not 

average rewards) in this section. We assume that the division 

into parts is done sequentially, so that individual estimates are 

built from nonoverlapping sequences of T consecutive blocks 

of examples. 

8 As shown in Li et al. (2011), when m is constant, making 
T large does not necessarily reduce variance of any estimator 
of nonstationary policies. 
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mass is bounded away from 1, that is, that there 
exists e such that for all k and Zk—i 


0 < £fc < £ < 1. 

The following theorem analyzes how much bias is 
introduced in the worst case, as a function of e. It 
shows how the bias mass controls the bias of our 
estimator. 


Theorem 5.1. ForT> 1, 


E, 




t=i 


- E„ 


,i=i . 


< 


T(T+1) £ 


1 — £ 


Intuitively, this theorem says that if a bias of e is 
introduced in round t, its effect on the sum of re¬ 
wards can be felt for T — t rounds. Summing over 
rounds, we expect to get an 0(eT 2 ) effect on the es¬ 
timator of the cumulative reward. In general a very 
slight bias can result in a significantly better accep- 
tance rate, and hence more replicates Tp^_ ns . 

This theorem is the first of this sort for policy eval¬ 
uators, although the mechanics of its proof have ap¬ 
peared in model-based reinforcement-learning (e.g., 
Kearns and Singh, 1998). 

To prove the main theorem, we state two technical 
lemmas bounding the differences of probabilities and 
expectations under the target policy and our algo¬ 
rithm (for proofs of lemmas, see Appendix F). The 
theorem follows as their immediate consequence. Re¬ 
call that 7r denotes the distribution over target his¬ 
tories generated by our algorithm (together with the 
exploration policy //). 


Lemma 5.2. Let t < T, k > 1 and let Zk— i be 
such that the kth exploration sample marks the be¬ 
ginning of the tth block, that is, n(t — 1) = k — 1. Let 
ht-i and ct be the target history and acceptance rate 
multiplier induced by Zk- 1 - Then: 


= x ' a K(t) = a] - M x ,a)\ < 

x,a 


\c t E£[V m \-E ”\r]\<j— e 


Lemma 5.3. 


Proof of Theorem 5.1. First, bound |E M [c t • 
Vbu)\ — En[rt}\ using the previous two lemmas, the 
triangle inequality and Holder’s inequality: 

lEjifoV^t)] -E^[r t ]| 

= \^[ctK(t)lVB(t)]}-VArt\\ 

<|E M [Enrt]]-E ff [Enr t ]]| + 1 47 


E 

H 

r 

i— 

- E 

H 

i' 

r- 


h t - 1~7T 


2 



2 



£ 


< \ -Tr(h t -i)\ + 

ht-i 

1 2s (t — 1) £ _ £t 

- 2 ’ 1— £ + 1 — £ — 1 — s' 

The theorem now follows by summing over t and 
using the triangle inequality. □ 

6. EXPERIMENTS: THE NONSTATIONARY 
CASE 

We now study how DR-ns may achieve greater 
sample efficiency than rejection sampling through 
the use of a controlled bias. We evaluate our estima¬ 
tor on the problem of a multiclass multi-label clas¬ 
sification with partial feedback using the publicly 
available dataset rcvl (Lewis et ah, 2004). In this 
data, the goal is to predict whether a news article is 
in one of many Reuters categories given the contents 
of the article. This dataset is chosen instead of the 
UCI benchmarks in Section 4 because of its bigger 
size, which is helpful for simulating online learning 
(i.e., adaptive policies). 

6.1 Data Generation 

For multi-label dataset like rcvl, an example has 
the form (x,Y), where x is the covariate vector and 
Y C {1 is the set of correct class labels. 9 

In our modeling, we assume that any y G Y is the 
correct prediction for x. Similar to Section 4.1, an 
example (x,Y) may be interpreted as a bandit event 
with context x and loss l(Y,a ) := /(a ^ Y), for ev¬ 
ery action oG{l ,K}. A classifier can be inter¬ 
preted as a stationary policy whose expected loss 


I>(M - 7r (kr)| < (2eT)/(l. - e). 

hjp 


9 The reason why we call the covariate vector x rather than 
x becomes in the sequel. 
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is its classification error. In this section, we again 
aim at evaluating expected policy loss, which can 
be understood as negative reward. For our exper¬ 
iments, we only use the K = 4 top-level classes in 
rcvl, namely {C,E,G,M}. We take a random se¬ 
lection of 40,000 data points from the whole dataset 
and call the resulting dataset D. 

To construct a partially labeled exploration data¬ 
set, we simulate a stationary but nonuniform explo¬ 
ration policy with a bias toward correct answers. 
This is meant to emulate the typical setting where 
a baseline system already has a good understanding 
of which actions are likely best. For each example 
(x,Y), a uniformly random value s(a ) € [0.1,1] is 
assigned independently to each action a, and the fi¬ 
nal probability of action a is determined by 


Hi(a\x,Y,s) 


0.3 x s(a) 0.7 x I(a € Y) 

HTU + W 


Note that this policy will assign a nonzero probabil¬ 
ity to every action. Formally, our exploration policy 
is a function of an extended context x = ( x,Y,s ), 
and our data generating distribution D(x) includes 
the generation of the correct answers Y and values 
s. Of course, we will be evaluating policies n that 
only get to see x, but have no access to Y and s. 
Also, the estimator l (recall that we are evaluating 
loss here, not reward) is purely a function of x and 
a. We stress that in a real-world setting, the explo¬ 
ration policy would not have access to all correct 
answers Y. 


6.2 Evaluation of a Nonstationary Policy 

As described before, a fixed (nonadaptive) classi¬ 
fier can be interpreted as a stationary policy. Simi¬ 
larly, a classifier that adapts as more data arrive is 
equivalent to a nonstationary policy. 

In our experiments, we evaluate performance of an 
adaptive e-greedy classifier defined as follows: with 
probability e = 0.1, it predicts a label drawn uni¬ 
formly at random from {1,2 ,K}; with probabil¬ 
ity 1 — e, it predicts the best label according to a 
linear score (the “greedy” label): 

argrnax{u)* ■ x}, 

a 

where {'U'a} a e{i....,A'} is a set of K weight vectors 
at time t. This design mimics a commonly used e- 
greedy exploration strategy for contextual bandits 
(e.g., Li et ah, 2010). Weight vectors w * are ob¬ 
tained by fitting a logistic regression model for the 


binary classification problem a £ Y (positive) ver¬ 
sus a£Y (negative). The data used to fit is de¬ 
scribed below. Thus, the greedy label is the most 
likely label according to the current set of logis¬ 
tic regression models. The loss estimator l(x,a ) is 
also obtained by fitting a logistic regression model 
for a € Y versus a ^ Y, potentially on a different 
dataset. 

We partition the whole data D randomly into 
three disjoint subsets: Anit (initialization set), 
71 valid (validation set), and Aval (evaluation set), 
consisting of 1%, 19%, and 80% of D , respectively. 
Our goal in this experiment is to estimate the ex¬ 
pected loss, V\ : t, of an adaptive policy n after 
T = 300 rounds. 

The full-feedback set Amt is used to fit the loss 
estimator l. 

Since Aalid is a random subset of D, it may be 
used to simulate the behavior of policy n to obtain 
an unbiased estimate of V\-t- We do this by taking 
an average of 2000 simulations of n on random shuf¬ 
fles of the set Aalid- This estimate, denoted V\ : t, is 
a highly accurate approximation to (the unknown) 
V\ : t, and serves as our ground truth. 

To assess different policy-value estimators, we ran¬ 
domly permute Aval and transform it into a par¬ 
tially labeled set as described in Section 6.1. On the 
resulting partially labeled data, we then evaluate the 
policy 7T up to round T, obtaining an estimate of 
V\ : T- If the exploration history is not exhausted, we 
start the evaluation of vr again, continuing with the 
next exploration sample, but restarting from empty 
target history (for T rounds), and repeat until we 
use up all the exploration data. The final estimate 
is the average across thus obtained replicates. We re¬ 
peat this process (permutation of Aval, generation 
of exploration history, and policy evaluation until 
using up all exploration data) 50 times, so that we 
can compare the 50 estimates against the ground 
truth V\ : T to compute bias and standard deviation 
of a policy-value estimator. 

Finally, we describe in more detail the e-greedy 
adaptive classifier n being evaluated: 

• First, the policy is initialized by fitting weights w° 
on the full-feedback set Anit (similarly to l). This 
step mimics the practical situation where one usu¬ 
ally has prior information (in the form of either 
domain knowledge or historical data) to initialize 
a policy, instead of starting from scratch. 
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• After this “warm-start” step, the “online” phase 
begins: in each round, the policy observes a ran¬ 
domly selected x, predicts a label in an e-greedy 
fashion (as described above), and then observes 
the corresponding 0/1 prediction loss. The policy 
is updated every 15 rounds. On those rounds, we 
retrain weights for each action a, using the full 
feedback set -Dinit as well as all the data from the 
online phase where the policy chose action a. The 
online phase terminates after T = 300 rounds. 

6.3 Compared Evaluators 

We compared the following evaluators described 
earlier: DM for direct method, RS for the unbiased 
evaluator based on rejection sampling and “replay” 
(Li et al., 2011), and DR-ns as in Algorithm 1 (with 
C m ay = 1)- We also tested a variant of DR-ns, which 
does not monitor the quantile, but instead uses q 
equal to minowe call it DR-ns-wc since it 
uses the worst-case (most conservative) value of ct 
that ensures unbiasedness of rejection sampling. 

6.4 Results 

Table 3 summarizes the accuracy of different eval¬ 
uators in terms of rmse (root mean squared error), 
bias (the absolute difference between the average es¬ 
timate and the ground truth) and stdev (standard 
deviation of the estimates across different runs). It 
should be noted that, given the relatively small num¬ 
ber of trials, the measurement of bias is not statis¬ 
tically significant. However, the table provides 95% 
confidence interval for the rmse metric that allows a 
meaningful comparison. 

It is clear that although rejection sampling is guar¬ 
anteed to be unbiased, its variance is usually the 
dominating part of its rmse. At the other extreme is 
the direct method, which has the smallest variance 
but often suffers large bias. In contrast, our method 
DR-ns is able to find a good balance between the two 


Table 3 

Nonstationary policy evaluation results 


Evaluator rmse (±95% C.I.) bias stdev 


DM 

RS 

DR-ns-wc 
DR-ns (q — 0) 
DR-ns (q = 0.01) 
DR-ns (q = 0.05) 
DR-ns (q = 0.1) 


0.0329 ±0.0007 
0.0179 ±0.0050 
0.0156 ±0.0037 
0.0129 ±0.0034 
0.0089 ±0.0017 
0.0123 ±0.0017 
0.0946 ±0.0015 


0.0328 

0.0027 

0.0007 

0.0181 

0.0086 

0.0132 

0.0046 

0.0122 

0.0065 

0.0062 

0.0107 

0.0061 

0.0946 

0.0053 


extremes and, with proper selection of the param¬ 
eter q, is able to make the evaluation results much 
more accurate than others. 

It is also clear that the main benefit of DR-ns is its 
low variance, which stems from the adaptive choice 
of ct values. By slightly violating the unbiasedness 
guarantee, it increases the effective data size signifi¬ 
cantly, hence reducing the variance of its evaluation. 
For q >0, DR-ns was able to extract many more tra¬ 
jectories of length 300 for evaluating ^ r, while RS and 
DR-ns-wc were able to find only one such trajectory 
out of the evaluation set. In fact, if we increase the 
trajectory length of 7r from 300 to 500, both RS and 
DR-ns-wc are not able to construct a complete tra¬ 
jectory of length 500 and fail the task completely. 

7. CONCLUSIONS 

Doubly robust policy estimation is an effective 
technique which virtually always improves on the 
widely used inverse propensity score method. Our 
analysis shows that doubly robust methods tend 
to give more reliable and accurate estimates, for 
evaluating both stationary and nonstationary poli¬ 
cies. The theory is corroborated by experiments on 
benchmark data as well as two large-scale real-world 
problems. In the future, we expect the DR technique 
to become common practice in improving contextual 
bandit algorithms. 


APPENDIX A: PROOFS OF LEMMAS 3.1-3.3 


Throughout proofs in this appendix, we write r 
and r* instead of f(x,a) and r*(x,a) when x and a 
are clear from the context, and similarly for A and 
0k- 

Lemma 3.1. The range ofVk is bounded as 
\V k \<l + M. 


Proof. 
\%\ = 


< \r(x k ,u)\ + . \r k -f(x k ,a k )\ 


0 k {flk \-^k) 


<1 + M, 


where the last inequality follows because f and r k 
are bounded in [0,1]. □ 

Lemma 3.2. The expectation of the term V k is 

E/[Vfc]= E [r*(x,a) + (1 — g k (x,a))A(x,a)]. 

(x,a)r^i/ 
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Proof. 

Kiv*] = , E 


(x,a,r)~/i fc . 

= E [r(x,v)\ 


\ v(a\x ) , 

r(x,v) H-—• Qk ■ [r - r) 


Hk(a\x) 


(r - f) 


= E [f(x,i/) 2 ] 

x^D 


x~D 


+ E 

x~D 


^/i fc (a|x) ^E 
aeA 


v(a\x) 


+ 2 E 

(x,a,r)~/i fc 


■D(-|x,a) L^fc(ak) 

Qk-{r- f) 


= E [r(x, u)\ 

x~D 


+ E 

x~D 


y^v(a\x) E {Qk-(r-r)\ 
W r~23(.|®,a) 


r(x, z/) 
^(a|x) 

zv(a|x) 

(x,a,r)~/ifc ./^(^l®) 

z/(a|x) 


Qk-(r~ r) 


+ E 


AfcH®) 


Qk-{r- fy 


= E [r] + E [Qk ■ (r - r)] 

(x,a)~v 

(A.l) = E [r* + (f - r*) + Qk ■ {r* - r)] 

(x,a)~v 

= E [r*+ (1-g k )A}. 

(pc,a)~v 


(A.2) 


= E [f(x,^) 2 ] 

x^D 


+ 2 E [r(x, v) ■ Qk ■ {r - r)] 

(x,a,r)~ v 


□ 


+ E 

(x ,a,r)~i' 


u(a\x) 

Qk(a\x) 


Qk ’ (l ff 


Lemma 3.3. The variance of the term 14 can be 
decomposed and bounded as follows: 

(i) V£[E fe ] 

= V [ E [r*{x,a) 

x~D a~v( • |ic) 

+ (1 - 0fc(z,a))A(x,a)]] 

- E [ E [£» fc (x,a)A(.x,a)] 2 ] 

x~D a~i/( • |ai) 


■Q k (x,a)- V [r] 
r~D( • | x,a) 


(A.3) 


= E [(f(x,i/)-Q k A)] 

(x,a)r^iy 


- E [ei& 2 ] + E, 


where E denotes the term 
v(a\x) 


E:= E 

(x,a,r)r^u 


fikifl\x) 


Qk-{r~ ff 


E 

(x,a)~i' 

u{a\x) 

fik{a\x) 

E 

(x,a)~l' 

v[a\x) 

Qk{a\x) 


Qk(x,a)A(x,aY 


(ii) V£fo] 


< V [r*(x,i/)] 

X~D 

+ 2 E [|(1 - Qk(x,a))A(x,a) 

(x,a)~is 

+ M E [Qk(x,a] 

(x,a)r^u 


E [(r - r(x,a)y]]. 

r~D( • | x,a) 


Proof. 

E£[U 2 ] = E 

(x,o,r)~/i fc . 


. u(a\x) 

r{x,u) H-—• Qk 

hk{a\x) 


To obtain an expression for the variance of 14, first 
note that by equation (A.l), 

(A.4) E %[%]= E [r(x,u)-Q k A], 

(x,a)^iy 

Combining this with equation (A.3), we obtain 
V£[l4] = V [r(x,z/) - 0 fe A] 

(oi,a)~iz 

- E [^.A 2 ] + E 

(x,a)~v 

= V[ E [f(x,v)-g k A]] 

x^D a~i'( • |ai) 

+ E [ V [f(x,i/)-Q k A]] 

a~£/( • |ai) 

- E [ V [g fc A]] 

x~-D a~^( ■ \x) 

- E [ E [^A] 2 ]+E 

x^D a~i/( • |ai) 

= V[ E [r* + (1 - Qk)A}] 

x~D a~i/( • |a;) 
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+ E [ V [gfeA]] 

x~D a~is( • I#) 

- E [ V [f?fcA]] 

x~D a~i/( • |ai) 

- E [ E [^ fc A] 2 ]+E 

x<~^D a~i/( • |#) 

= V[ E [r* + (l-e fc )A]] 

x~D • |ai) 

- E [ E [g k A] 2 } + E. 

x~D a~i/( • |a;) 


We now obtain part (i) of the lemma by decompos¬ 
ing the term E: 


E= E 


(a;,a,r)~v jltfc(d|x) 

v(a\x) 


v(a\x) 2 

Qk ■ (r — r ) 


+ E 

(x,a)~i/ 


= E 


fi k {a\x) 
u(a\x) 


Qk ■ {r ~ r ) 


- r , , x ■ Qk ■ V [rj 

(x,a)r^jy r~D(-|a;,a) 


+ E 


v(a\x) 


Qk A 2 


(x,a)~i' |*^) 

To prove part (ii) of the lemma, first note that 
r(x,v) 2 = (r*(x,is) + E [A(x,a)]) 2 

• \x) 

= r*(x, u) 2 + 2r*(x, v) E [A(x,a)] 


a~i/( • I#) 


+ E [A(x,a)]" 

a~i/( • |a;) 


= r*(x,is) 2 + 2r(x,is) E [A(x,a)] 

a~i'( • |ai) 

- E [A(rc,a)] 2 

a~z/( • I#) 

< r*(x, v) 2 + 2r(x, v) E [A(x,a)]. 

a~i/( • |#) 

Plugging this in equation (A. 2), we obtain 
KiVk]= E [r(x,i/) 2 ] 

x~D 

+ 2 E [r(x, u) ■ Q k ■ (r - f)] + E 

(:r,a,7*)~i/ 

< E [r*(x, u) 2 \ 

x^D 

+ 2 E \r{x,v) E [A]] 

x~D a~v( • |rr) 

+ 2 E [r(x,u) ■ (-Qk) ■ A] + E 

(x,a)~i' 


(A.5) = E [r*(x,v) 2 } 

x~D 

+ 2 E [f(x, v) ■ (1 - Qk) ■ A] + E. 

(x,a)^iy 

On the other hand, equation (A.4) can be rewritten 
as 


E£[T4]= E [r*(x,u) + (1 — f?fc)A]. 

(x,a)r^iy 

Combining with equation (A.5), we obtain 
V£[Vfc]< V [r*(x, u)\ 


+ 2 E [r(x,u) ■ (1-Q k )A] 

(x,a)~L/ 

-2 E [r*{x,v)\ E [(l-e fc )A] 

x^D (x,a)~L/ 

- E [(l-Q k )A] 2 + E 
(x,a)~z/ 

< V [r*(x,v)\ 

x~D 

+ 2 E [(r{x,u) - |)(1 - Q k )A] 

(x,a)~L/ 

-2 E [r*{x,v)-\] E [(1 - gfc)A] 

x^D (x,a)~u 

+ E 


< V [r*(x,i/)]+ E [|(1 — ff fc )A|] 

x^D (x,a)~v 

+ | E [(1 — f?fc)A]| + E, 

(x,a)~v 


where the last inequality follows by Holder’s inequal¬ 
ity and the observations that \r — 1/2| < 1/2 and 
| r* — 1/2|< 1/2. Part (ii) now follows by the bound 


E = 


E 

(x,a,r)r^jy 


u(a\x) 

Qk(a\x) 


• Qk-{r- 


f) 


2 


<M E [Qk E l(r-f) 2 }}. 

(x,a)~i/ r~D(-\x,a) 


□ 


APPENDIX B: FREEDMAN’S INEQUALITY 

The following is a corollary of Theorem 1 of 
Beygelzimer et al. (2011). It can be viewed as a 
version of Freedman’s inequality Freedman’s (1975). 
Let yi,...,y n be a sequence of real-valued random 
variables. Let E*. denote E[- \yi, ..., y k -i\ and V/. 
conditional variance. 

Theorem B.l. Let V, D gM. such that 

n 

Y^Vk[y k }<v, 

k= 1 

















DOUBLY ROBUST POLICY EVALUATION AND OPTIMIZATION 


23 


and for all k, \| < D. Then for any 6 > 0, 
with probability at least 1 — 5, 


n n 

£®-X> k[yk} 

k= 1 1 


< 2max{.Dln(2/<5), yjv ln(2/<5)}. 


Proof. The proof follows by Freedman’s in¬ 
equality (Theorem B.l in Appendix B), applied to 
random variables V&, whose range and variance are 
bounded using Lemma 3.1 and C.l. □ 


APPENDIX C: IMPROVED FINITE-SAMPLE 
ERROR BOUND 

In this appendix, we analyze the error of Udr in 
estimating the value of a stationary policy n. We 
generalize the analysis of Section 3.4.3 by replacing 
conditions on the ranges of variables by conditions 
on the moments. 

For a function /: X x A —> R and 1 < p < oo, we 
define the L p (y) norm as usual: 

||/|| P)I/ = E (x>a) ^[|/(*,a)n 1/p . 

For p = oo, H/lloo,^ is the essential supremum of |/| 
under v. 

As in Section 3.4.3, we first simplify Lemmas 3.1- 
3.3, and then apply Freedman’s inequality to obtain 
a specific error bound. 


APPENDIX D: DIRECT LOSS MINIMIZATION 

Given cost-sensitive multiclass classification data 
{(x, l \,..., Ik)}, we perform approximate gradi¬ 
ent descent on the policy loss (or classification 
error). In the experiments of Section 4.1, pol¬ 
icy v is specified by K weight vectors 6 \,.. . ,0 k- 
Given x € X, the policy predicts as follows: v{x) = 
argmax ae{1) ^ K} {x ■ 0 a }. 

To optimize 0 a , we adapt the “toward-better” 
version of the direct loss minimization method of 
McAllester, Hazan and Keshet (2011) as follows: 
given any data point (x, l±,..., Ik) and the current 
weights 6a, the weights are adjusted by 

0a\ v _ 0 a i + px, 

0a 2 0a 2 ~ IX, 


Lemma C.l. Let 1 < p,q < oo be such that 
1/p + 1/q = 1. Assume there are finite constants 
M, er,5/±,5 g , PmM > 0 such that with probability one 
under p, for all k: 

v{a k \Xk)/p>k(ak\xk) < M, 

||^||g,zz ^ 5 

II1 Qk ||p,^ — 

Qk llp,^ — ^maxi 
E [ E [(r(x,a)-r) 2 } g ] 1/q <er. 

(x,a)~i/ r~D( • \x,a) 

Then with probability one under p, for all k: 
KlVk]-V\<6 e 6 A , 

Vffik] < V x ^ D [r*(x, v)\ + 25 e 5 A + Mg max^f • 

Proof. The bias and variance bound follow 
from Lemma 3.2 and Lemma 3.3(h), respectively, 
by Holder’s inequality. □ 

Theorem C.2. If assumptions of Lemma C.l 
hold, then with probability at least 1 — 8, 

\V OR -V\ 

< 8 g 6 A 

+2ma J(i 

( n 

(yx~p[r*(x,v)\ +28 g 5 A + Mg max e f )ln(2/(5) \ 
n 


where ai = argmax a {a: • 6 a — el a }, 02 = argmax a {a; • 
0 a }, r) € (0,1) is a decaying learning rate, and e > 0 
is an input parameter. 

For computational reasons, we actually perform 
batch updates rather than incremental updates. Up¬ 
dates continue until the weights converge. We found 
that the learning rate rj = t~ 0 ' 3 / 2, where t is the 
batch iteration, worked well across all datasets. The 
parameter e was fixed to 0.1 for all datasets. 

Furthermore, since the policy loss is not convex 
in the weight vectors, we repeat the algorithm 20 
times with randomly perturbed starting weights and 
then return the best run’s weight according to the 
learned policy’s loss in the training data. We also 
tried using a holdout validation set for choosing the 
best weights out of the 20 candidates, but did not 
observe benefits from doing so. 

APPENDIX E: FILTER TREE 

The Filter Tree (Beygelzimer, Langford and Ravi- 
kumar, 2008) is a reduction from multiclass cost- 
sensitive classification to binary classification. Its 
input is of the same form as for Direct Loss Min¬ 
imization, but its output is a Filter Tree: a decision 
tree, where each inner node is itself implemented 
by some binary classifier (called base classifier), and 
leaves correspond to classes of the original multi¬ 
class problem. As base classifiers we used J48 de¬ 
cision trees implemented in Weka 3.6.4 (Hall et ah, 
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2009). Thus, there are 2-class decision trees in the 
nodes, with the nodes arranged as per a Filter Tree. 
Training in a Filter Tree proceeds bottom-up, but 
the classification in a trained Filter Tree proceeds 
root-to-leaf, with the running time logarithmic in 
the number of classes. We did not test the all-pairs 
Filter Tree, which classifies examples in the time lin¬ 
ear in the number of classes, similar to DLM. 

APPENDIX F: PROOFS OF LEMMAS 5.2 
AND 5.3 

Lemma 5.2. Let t <T, k>l and let Zk- i be 
such that the kth exploration sample marks the be¬ 
ginning of the tth block, that is, n(t — 1) = k — 1. Let 
ht-i and ct be the target history and acceptance rate 
multiplier induced by Zk- 1 - Then: 


and the marginal probability of accepting a sample 
from p m is 

accept m (*) := accept m (x, a) 

x,a 

— Q CjEjtj — Ct(l £m)- 

In order to accept the mth exploration sample, sam¬ 
ples k through m — 1 must be rejected. The proba¬ 
bility of eventually accepting (x,a), conditioned on 
Zk -1 is therefore 

oo m—1 

Y accept m (x,a) JJ (1 - accept fc ,(*)) 

,m>k k'=k 

(F.3) = c t iT t (x,a) 



(F-l) =x,a K ( t ) =a\ -Tr t (x,a)\ 

x,a 


(F.2) \ct^ k [V m \-n\r)\<^- £ . 

Proof. We begin by showing equation (F.l). 
Consider the mth exploration sample (x,a) ~ 
and assume that this sample is in the tth block. The 
probability of accepting this sample is 

< cfK t (a\x) 

~ Hm{a\x) _ 

r / . „ Ct 7 Tt(a\x) . , „ . 

= I[(x, a) € £ m \ H - w-j —rl[(x, a) <£ £ m \, 

where /[•] is the indicator function equal to 1 when 
its argument is true and 0 otherwise. The probability 
of seeing and accepting a sample (x,a) from fjL m is 


1 x,a) 


accept m (a:, a) 

:= Vm(x, cl) ( I[(x, a) G £ m \ 


C-t^t{Ci\x) |y \ / C 1 

“I 7 i Cl) £ m \ 


= /^m(*E> Cl)I\(x, Cl) G £m\ 

+ ctir t (x,a)I[(x,a) ££m\ 
= c t 7r t (x,a) 


•K 


oo m—1 

E na—ePVW) 

_m>k k'=k 


(F.4) 


~K 


(c t TTt(x, a) - Hm{x, a )) 


m>/c 


’ I\{Xi Cc) £ £m\ 


m—1 


■ JJ(1- accept*.,(*)) 


k’=k 


To bound |P£[x K ( t ) = x,a K ( t ) = a] - n t (x,a)\ and 
prove equation (F.l), we first need to bound equa¬ 
tions (F.3) and (F.4). Note that from the definition 
of £ m , the expression inside the expectation of equa¬ 
tion (F.4) is always nonnegative. Let E\(x,a) denote 
the expression in equation (F.3) and E^ix, a) the ex¬ 
pression in equation (F.4). We bound E\(x,a) and 
i ?2 {x, a) separately, using bounds 0 < e m < s: 


Ei(x,a) = ctirt(x, a)E^ 

< c t ir t (x,a)E £ 

_ 7 r t (x,a) 
l — £ ’ 

Ei(x,a) > c t irt(x,a)E% 


oo m—1 


e n( i - acce pvw) 

m>k k'=k 
oo m—1 


_ m>k k'=k 


oo m—1 


E IK 1 -*) 


,m>k k'=k 


- (ctTT t (x, a) - Hm(x , a))I[(x , o) € £ m \ 


= Tv t (x,a), 
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E 2 (x,a) = E 1 - 1 


<K 


E (ct vr f (x, a) - // m (z, a)) 

_m>k 

• /[(x, ft) G £ 777 ,] 
m—1 

• (1 - accept*,/(*)) 
k'=k 
oo 

E ( ctir t (x , a) - a)) 

jn>k 

/[(x, Cl) G Sm ] 


m—k 


•(l-c t (l-e)) 

Now we are ready to prove equation (F.l): 

= x , a n(t) =a) -vr t (x,a)| 

x,a 

= ^ j \E 1 (x,a) -7T t (x,a) -E 2 {x,a)\ 

x,a 

< ^ |£'i(x,a) -7r t (x,a)\ + ^E 2 (x,i 

x,a x,a 

ir t (x,a)e 


< 




x,a 


+ E£ 


— £ 


E E^O’ a) - ii m (x,a)) 

.m>k x,a 


I[(x,a) <E £ m ]( 1 - cj(l -e)) 


m—k 


< 


1 — £ 
2g 

1 — £ 


+ E£ 


y^c t £ m (l-c t (l-e)) 


m—k 


jm>k 


proving equation (F.l). 

Let reach m denote the indicator of the event that 
the mth sample is in block t (i.e., samples k,k-\- 
1,... ,m — 1 are rejected). Then 

OO 

K^t)} = E Efc [Vm. reach m ] 

m=k 

oo 

= £ E^[E^[t> m reach m ]j 


m=k 


oo 

(F.5) = y^E^reach m E^[F m ]], 

m=k 

where equation (F.5) follows because the event of 
reaching the mth sample depends only on the pre¬ 
ceding samples, and hence it is a deterministic func¬ 
tion of z m -\. Plugging Lemma 3.2 in equation (F.5), 
we obtain 


*K\Vm\ 


Ct E [r] Y E£ [reach* 


7T£ 


m=k 


= c t E H EC 


r~7T£ 


oo m— 1 

En (1 — accept*./(*)) 

_m=kk'=k 


(because E r ~ 7 r t [r] is a deterministic function of 
Zk~ i). This can be bounded, similarly as before, as 


E 


Tr^'Kt L 


E [r]<c t E»[V m \< 

r^ivt 1 — c 


yielding equation (F.2). □ 

Lemma 5.3. 


y^|7r(L T ) - vr(/i T )| < (2eT)/(l - e). 

hjp 

Proof. We prove the lemma by induction and 
the triangle inequality (essentially following Kakade, 
Kearns and Langford, 2003). The lemma holds for 
T = 0 since there is only one empty history (and 
hence both tt and 7r are point distributions over h®). 
Now assume the lemma holds for T — 1. We prove 
it for T: 


y>(M-vr(Mi 

h,T 

= E E |7r(/i T _i)7r T (*T,aT,rT) 

hjp —i (xt jCL'j 1 Xt) 

- 7r(/i T _i)7r T (j;r, a T , r T )\ 

£ E E (|7r(/i T _i)7r T (a;r,aT,rT) 

Ht— 1 {%T i a T j ^ t ) 


— 7r(/iT-i)7Tr(a:r,aT,^T)| 

+ |7r(/i T „i)7r T (xr,aT,rr) 

— tt ( h,T— i) t^t ( x t , i tt ) |) 


E 

hj 1 — i'-' 


E |vr T (xT,aT,rT) 

( XT,a>Ti r T ) 
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- 7r T (xT,aT,r T )| 

+ ^2 \n(h T -i) - 7r(/i T _i)| 
hr—i 

2e 2 e(T - 1) 2 eT 

^ + □ 

APPENDIX G: PROGRESSIVE VALIDATION 
POLICY 

In Section 4.1.3, we showed how the stationary DR 
estimator can be used not only for policy evaluation, 
but also for policy optimization by transforming the 
contextual bandit problem into a cost sensitive clas¬ 
sification problem. 

In this appendix, we show how the nonstation¬ 
ary DR estimator, when applied to an online learn¬ 
ing algorithm, can also be used to obtain a high- 
performing stationary policy. The value of this pol¬ 
icy concentrates around the average per-step re¬ 
ward estimated for the online learning algorithm. 
Thus, to the extent that the online algorithm 
achieves a high reward, so does this stationary 
policy. The policy is constructed using the ideas 
behind the “progressive validation” error bound 
(Blum, Kalai and Langford, 1999), and hence we 
call it a “progressive validation policy.” 

Assume that the algorithm DR-ns successfully ter¬ 
minates after generating T blocks. The progressive 
validation policy is the randomized stationary policy 
7Tpv defined as 

/ i v—>. cd.B(t)| , . . 

7T P v(a|x) := 2^ — £ — n{a\x, h t - 1 ). 

t=l 

Conceptually, this policy first picks among the histo¬ 
ries ho, ■ ■ ■ ,hT-i with probabilities c\\B(l)\/C,..., 
ct\B(T)\/C, and then executes the policy ir given 
the chosen history. We extend 7Tpy to a distribution 
over triples 

7Tpy(a;, a, r) = D(x)ivpy(a\x)D(r\x, a). 

We will show that the average reward estimator 
^DR-ns returned by our algorithm estimates the ex¬ 
pected reward of -/rpy with an error 0(1/ \/N) where 
N is the number of exploration samples used to gen¬ 
erate T blocks. Thus, assuming that the nonstation¬ 
ary policy 7r improves with more data, we expect 
to obtain the best-performing progressive validation 
policy with the most accurate value estimate by run¬ 
ning the algorithm DR-ns on all of the exploration 
data. 


The error bound in the theorem below is proved by 
analyzing range and variance of V/ using Lemma 3.8. 
The theorem relies on the following conditions (mir¬ 
roring the assumptions of Lemma 3.8): 

• There is a constant M > 0 such that nt(ak\xjf)/pk < 
M. 

• There is a constant > 0 such that 

E(x,a)~ir t [E D l(r - r) 2 I x,a\] < e?. 

• There is a constant v r > 0 such that 

[E ri (jr^7rt( V ID M] — D" 

These conditions ensure boundedness of density ra¬ 
tios, squared prediction error of rewards, and vari¬ 
ance of a conditional expected reward, respectively. 
It should be noted that, since rewards are assumed 
to be in [0,1], one can always choose e? and v r that 
are no greater than 1. 


Theorem G.l. Let N be the number of explo¬ 
ration samples used to generate T blocks, that is, 
N = yj_ ] \B(t)\. Assume the above conditions hold 
for all k and t (and all histories Zk-i and ht- 1 ). 
Then, with probability at least 1 — 5, 


I ^DR-ns E [r] | 

r~7Tpy 


. AT max 

“ C 


2 max 


(1 + M)ln(2/<S) 
N 


(v r + Me?) ln(2/d) 
N 


Proof. The proof follows by Freedman’s in¬ 
equality (Theorem B.l in Appendix B), applied to 
random variables qI 4, whose range and variance 
can be bounded using Lemma 3.8 and the bound 
ct < c max . In applying Lemma 3.8, note that 5 e = 0 
and g max = 1, because fik = hk- □ 
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